Intracellular infections of epithelial cells are hypothesized to produce expression of gene pathways related to immune response, infection response, and inflammation. To test this hypothesis, Salmonella Typhimurium and the Hela-229 cell line were chosen. Salmonella are intracellular bacteria, which infect and replicate within both epithelial cells and macrophages, causing acute gastroenteritis in humans. The HeLa-229 human cell line is epithelial in origin, so will constitute an acceptable host for Salmonella.
This experiment will test this hypothesis using a DESeq2 analysis, with a two-group experimental design.
An DESeq2 analysis was conducted on data comparing gene expression in HeLa-229 cells infected with Salmonella Typhimurium with a control group. The analysis was performed with the DESeq2 tool, using the R package of the same name.
| Title: | Transcriptomic profiling of HeLa-229 cells infected with Salmonella Typhimurium |
| Accession Number: | SRP034009 [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP034009] |
| BioProject: | PRJNA231559 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA231559] |
| File URL | [http://duffel.rail.bio/recount/v2/SRP034009/rse_gene.Rdata] |
The data set was a two-group experimental design with three replicates in each group. One group was the Salmonella infected cells, the second, the control group.
The DESeq2 R package was used to perform the Differential Gene Expression Analysis (DGE) and the clusterProfiler R package for the Gene Set Enrichment Analysis (GSEA).
A Principal Component Analysis (PCA) was performed on the data which showed that first 2 components explained 95.4% of the variance.
The component plots show strong differentiation on the PC1 axis for the control and condition groups, with tight clustering for the control replicates, and Salmonella replicates split into two clusters, with Salmonella Replicate 1 differing from the other 2 Salmonella replicates. The heatmap shows a different pattern of of expression for Salmonella replicate and this is explorer further in section.
The DESeq2 analysis was generated and a MA plot generated, which showed decreasing variability with increased counts, and an approximately equal proportion of over and under expressed genes.
The table below shows all the significantly differentially expressed genes.
The top most significantly expressed genes included CCL2, IL6, CXCL2, CTSL and all of these are involved in inflammation or immune response pathways. Since they are differentially expressed, the presence of Salmonella is a possible cause. The underlying presence of HPV-18 in the cell line may also explain some of this expression1, but this cannot be determined with the experiment design.
Overall the plot shows conclusively that significant over and under expression occurred, with somewhat more over-expression than under-expression.
Evaluating the top genes together as a group using Enrichr 2 demonstrated that they are correlated with pathways such as Endogenous Toll-Like Receptor signalling, involved in infection responses and IL-17 Signalling, involved in inflammation responses. These pathways were detected in the GSEA analysis as well.
A Gene Set Enrichment Analysis (GSEA) was performed, in which some pathways 3 were related to infection and immunologic specific pathways, specifically:
Some pathways were suggestive of tumor expression, such as GOBP_EPITHELIAL_CELL_PROLIFERATION. The presence of these may be confounding factors related to the HeLa cell line or HPV-18.
Overall, the most significant pathways had very close enrichment scores (within 15% of each other, for the top 100 pathways), with many pathways not obviously related to the experiment. Therefore, no obvious trend was revealed from the GSEA analysis, and support for the hypothesis was weak.
The peak enrichment scores for the top 5 plots were very similar, although most of the pathways did not directly relate to the hypothesis being explored.
The significant pathways are shown in the table below. All have very close enrichment scores - from 2.4 to 2.06 for the first 100 pathways. So the the most significant pathways are too similarly expressed to make meaningful distinctions between them.
The GSEA analysis showed several pathway expression relating to inflammation, immune and bacterial infection response, along with many other pathways with similar enrichment scores. Overall, there was limited support for the hypothesis.
There was expression of some pathways potentially related to oncogenesis, which could possibly be related to the cell line. There is previous evidence to suggest highly aberrant expression of some pathways in this cell line 4. Also, the cell line contains Human Papilloma Virus 18 DNA, known to cause the majority of cervical neoplasms and also implicated in the cell line’s characteristic immortality. There is evidence of HPV-18 mRNA expression in HeLa cell lines5. These are all confounding factors which warrant further investigation.
The value of investigating individual gene expression with heatmaps or volcano plots for evaluating complex pathways appears limited, as gene expression and functional roles are heavily context dependent. The GSEA analysis was found to be slightly more effective in discovering the cellular responses to intracellular bacterial infection.
Given the nature of the cell-line used in this experiment, future experiments could be made with other, non-neoplastic epithelial cell lines, to try and eliminate the confounding problems of aberrant tumor cell expression and viral infection. Another dimension would be to use other intracellular bacteria to see how pathway expression differs across bacterial species. It is also unclear, at what stage of the infection the samples were taken, as this may affect gene expression, as well as the natural variation in infection life-cycles between individual cells.
From a methodological perspective, it is recommended to perform RNA-seq analyses with at least two methods, from toolsets such as DESeq2, EdgeR and Limma-Voom.
A review of the DEseq data found a large number of read count duplicates (87.6% of counts has at least 1 duplicate), which gave warnings in the GSEA analysis. Duplicate reads are common in RNA-seq data sets, though the high degree of duplicate may be a cause for concern and may be caused by technical sequencing issues or a high abundance of a small number of genes.
After consideration, no action was taken to remove the duplicate values.
The table below shows duplicates of two or more counts in the raw assay data.
# assay count duplicates
assay_df <- data.frame(assay(rse_gene))
assay_df<- assay_df %>% arrange(desc(SRR1049363))
dupsCounts <- assay_df %>% group_by(SRR1049363) %>% filter(n() > 1)
head(dupsCounts, n = 12)
## # A tibble: 12 × 6
## # Groups: SRR1049363 [6]
## SRR1049363 SRR1049364 SRR1049365 SRR1049366 SRR1049367 SRR1049368
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 54302 45609 40415 21281 23656 29219
## 2 54302 61654 61103 10225 7706 9846
## 3 53022 55343 51375 34511 34428 41473
## 4 53022 58643 56965 24197 26196 31438
## 5 50472 58848 55352 55674 52378 64884
## 6 50472 52794 53197 30063 30721 38816
## 7 49502 48087 45419 50665 45347 54503
## 8 49502 55923 59935 19846 18815 21361
## 9 46223 49472 47237 13190 14856 17526
## 10 46223 54347 51152 29966 30935 38122
## 11 44507 55588 53645 48451 48295 56803
## 12 44507 52747 53455 16894 16916 20433
duplicated_n <- nrow(dupsCounts)
duplicated_percentage <- round(duplicated_n/nrow(assay_df) * 100, 1)
print(duplicated_percentage)
## [1] 87.6
The initial DESeq2 analysis discovered a high proportion of outliers (13%). The data contains 2 groups of 3 replicates, and the DESeq analysis contains a parameter ‘minReplicatesForReplace’ with a default value of 7, which is ‘the minimum number of replicates required in order to use replaceOutliers on a sample’, so for this data set, no outlier replacement was performed.
out of 30230 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 5555, 18%
LFC < 0 (down) : 5050, 17%
outliers [1] : 3820, 13%
low counts [2] : 2187, 7.2%
(mean count < 9)
A Box plot of the Cooks distances shows similar distributions of outliers for all replicates, so no single replicate stands out as a source of technical error. One replicate, SRR1049367 (Salmonella 2), did contain a lot of small distances, but this is not of significance.
The DESeq2 analysis was repeated with ‘minReplicatesForReplace’ set to the minimum value of 3, which allows outliers in a single control group to be replaced.
This configuration eliminated the outliers by replacement.
summary(dds_res)
##
## out of 25029 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 5426, 22%
## LFC < 0 (down) : 4909, 20%
## outliers [1] : 0, 0%
## low counts [2] : 1456, 5.8%
## (mean count < 16)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
A review of the most extreme raw counts shows that the very high log2 fold changes are non-zero values in a single replicate group, with zero values in the other group. These values appear valid as the non-zero counts are relatively consistent between replicates, and the zero values merely represent a lower than ideal sequencing depth.
library(tibble)
raw_counts <- counts(dds)
raw_counts <- data.frame(raw_counts) %>%
rownames_to_column("GENEID") %>%
mutate(GENEID=sub('\\..+', "", GENEID))
# possible outliers in this data set
dds_res_max_values <- dds_res_df %>%
dplyr::filter(abs(log2FoldChange) > 10) %>%
arrange(desc(log2FoldChange))
# this is raw counts
inner_join(x=raw_counts,
y=dds_res_max_values,
by="GENEID") %>%
dplyr::select(GENEID, symbol, log2FoldChange, c(2:7)) %>%
arrange(desc(log2FoldChange)) %>%
dplyr::filter(row_number() <= TOP_PATHWAYS | row_number()> n()- TOP_PATHWAYS) %>%
dplyr::rename("Ctl1" = "SRR1049363", "Ctl2" = "SRR1049364", "Ctl3" = "SRR1049365",
"Salm1" = "SRR1049366", "Salm2" = "SRR1049367", "Salm3" = "SRR1049368")
## GENEID symbol log2FoldChange Ctl1 Ctl2 Ctl3 Salm1 Salm2 Salm3
## 1 ENSG00000145864 GABRB2 30.00000 0 0 0 0 75264 51774
## 2 ENSG00000057149 SERPINB3 15.97002 0 0 0 7172 7578 7139
## 3 ENSG00000171557 FGG 15.55907 0 0 0 5571 5609 5271
## 4 ENSG00000206073 SERPINB4 15.50929 0 0 0 6036 4874 4973
## 5 ENSG00000108700 CCL8 14.79386 0 0 0 3339 3081 3287
## 6 ENSG00000126838 PZP -11.13688 605 542 858 0 0 0
## 7 ENSG00000141639 MAPK4 -11.22592 909 511 688 0 0 0
## 8 ENSG00000185518 SV2B -11.53802 539 1113 1025 0 0 0
## 9 ENSG00000232537 RP11-385M4.1 -11.54841 721 832 1123 0 0 0
## 10 ENSG00000164120 HPGD -11.79331 912 1502 760 0 0 0
Some counts appeared in only a single replicate and were highly significant - these were considered technical artefacts that were removed. Many of these counts had duplicated normalized counts, and a sample of these is shown below.
All of these were removed from the data set.
head(sig_norm_counts)
SRR1049363 SRR1049364 SRR1049365 SRR1049366 SRR1049367 SRR1049368
ENSG00000145864.12 0 0 0 0 99913.64158 58646.4
ENSG00000177770.11 0 0 0 0 92.92563 0.0
ENSG00000200502.1 0 0 0 0 92.92563 0.0
ENSG00000206881.1 0 0 0 0 92.92563 0.0
ENSG00000228550.1 0 0 0 0 92.92563 0.0
ENSG00000253302.1 0 0 0 0 92.92563 0.0
DESeq2 assumes a negative normal distribution, so the data was checked to assess where this was a valid assumption. The dispersion plot shows that variances exceed the means, satisfying the requirement for accepting the negative binomial distribution assumption.
The gseaplot function in clusterProfile has font settings which are too big for R markdown files. A defect ticket was raised in the enrichplot repository in GitHub(https://github.com/YuLab-SMU/enrichplot/issues/129) and a local wrapper function gseaplot_local which replicated the gseaplot function but added configurable font settings to enable printing multiple GSEA plots side-by-side.
The code for this is in
Once this defect is addressed, the local wrapper function can be replaced by the original gseaplot function call.
The details below detail the relevant system version details used to produce this analysis.
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] enrichplot_1.12.2 ggpubr_0.4.0 clusterProfiler_4.0.2
## [4] msigdbr_7.4.1 pheatmap_1.0.12 EnhancedVolcano_1.10.0
## [7] here_1.0.1 DT_0.18 stringr_1.4.0
## [10] data.table_1.14.0 tibble_3.1.3 dplyr_1.0.7
## [13] BiocParallel_1.26.1 gridExtra_2.3 plotly_4.9.4.1
## [16] ggrepel_0.9.1 ggplot2_3.3.5 DESeq2_1.32.0
## [19] SummarizedExperiment_1.22.0 Biobase_2.52.0 MatrixGenerics_1.4.0
## [22] matrixStats_0.59.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
## [25] IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 shadowtext_0.0.8 fastmatch_1.1-0
## [5] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2 splines_4.1.0
## [9] crosstalk_1.1.1 digest_0.6.27 htmltools_0.5.1.1 GOSemSim_2.18.0
## [13] viridis_0.6.1 GO.db_3.13.0 fansi_0.5.0 magrittr_2.0.1
## [17] memoise_2.0.0 openxlsx_4.2.4 Biostrings_2.60.1 annotate_1.70.0
## [21] graphlayouts_0.7.1 extrafont_0.17 extrafontdb_1.0 colorspace_2.0-2
## [25] blob_1.2.1 haven_2.4.1 xfun_0.24 crayon_1.4.1
## [29] RCurl_1.98-1.3 jsonlite_1.7.2 scatterpie_0.1.6 genefilter_1.74.0
## [33] survival_3.2-11 ape_5.5 glue_1.4.2 polyclip_1.10-0
## [37] gtable_0.3.0 zlibbioc_1.38.0 XVector_0.32.0 DelayedArray_0.18.0
## [41] proj4_1.0-10.1 car_3.0-11 Rttf2pt1_1.3.8 maps_3.3.0
## [45] abind_1.4-5 scales_1.1.1 DOSE_3.18.1 DBI_1.1.1
## [49] rstatix_0.7.0 Rcpp_1.0.7 viridisLite_0.4.0 xtable_1.8-4
## [53] tidytree_0.3.4 foreign_0.8-81 bit_4.0.4 htmlwidgets_1.5.3
## [57] httr_1.4.2 fgsea_1.18.0 RColorBrewer_1.1-2 ellipsis_0.3.2
## [61] pkgconfig_2.0.3 XML_3.99-0.6 farver_2.1.0 locfit_1.5-9.4
## [65] utf8_1.2.1 tidyselect_1.1.1 labeling_0.4.2 rlang_0.4.11
## [69] reshape2_1.4.4 AnnotationDbi_1.54.1 cellranger_1.1.0 munsell_0.5.0
## [73] tools_4.1.0 cachem_1.0.5 cli_3.0.0 downloader_0.4
## [77] generics_0.1.0 RSQLite_2.2.7 broom_0.7.8 evaluate_0.14
## [81] fastmap_1.1.0 yaml_2.2.1 ggtree_3.0.2 babelgene_21.4
## [85] knitr_1.33 bit64_4.0.5 tidygraph_1.2.0 zip_2.2.0
## [89] purrr_0.3.4 KEGGREST_1.32.0 ggraph_2.0.5 nlme_3.1-152
## [93] ash_1.0-15 ggrastr_0.2.3 aplot_0.0.6 DO.db_2.9
## [97] rstudioapi_0.13 compiler_4.1.0 curl_4.3.2 beeswarm_0.4.0
## [101] png_0.1-7 ggsignif_0.6.2 treeio_1.16.1 tweenr_1.0.2
## [105] geneplotter_1.70.0 stringi_1.7.3 highr_0.9 ggalt_0.4.0
## [109] forcats_0.5.1 lattice_0.20-44 Matrix_1.3-4 vctrs_0.3.8
## [113] pillar_1.6.1 lifecycle_1.0.0 BiocManager_1.30.16 cowplot_1.1.1
## [117] bitops_1.0-7 patchwork_1.1.1 qvalue_2.24.0 R6_2.5.0
## [121] rio_0.5.27 KernSmooth_2.23-20 vipor_0.4.5 MASS_7.3-54
## [125] assertthat_0.2.1 rprojroot_2.0.2 withr_2.4.2 GenomeInfoDbData_1.2.6
## [129] hms_1.1.0 grid_4.1.0 tidyr_1.1.3 rmarkdown_2.9
## [133] rvcheck_0.1.8 carData_3.0-4 ggforce_0.3.3 ggbeeswarm_0.6.0